What happened at first

The major part of the last months we were trying to solve one bug that we didn’t know how to solve until this days, some time before fastai version 2 release having a POC and passing the hackathon we participated we left a little the project, but when we came back to it, there were some extrange things happening, some models where not training and apparently others where training, we didn’t understand what happened, in that time we thought we introduced an error in our code some way.

The journey

So, these last few days since I had locally installed XLA and been able to run things, I planned to take a new round to find our bug and it was a perfect opportunity to test what having locally installed pytorch+xla+fatai_xla_etensions could do.

So I passed a lot of assumptions to finally find the solution.

So after having locally installed XLA, one of the first things I wanted to test is if locally I could reproduce the error, and I could!
The first one was that the error was in our code, but having locally XLA allowed me to test the exact same code changing the device to either: CPU, CUDA or XLA. So I ended up having 2 data loaders and 2 trains on the same python file. Also one main point was that I wrapped a Adam but from pytorch with OptimWrapper and it trained correctly so I was more suspicious of differences between the optimizer from fastai and the native to pytorch because there is one know difference about __getstate__ that is also a requirement for TPU Pods.
In the past we have also thought that it was a freeze unfreeze problem, but it was also discarded, so this time I was checking the optimizer, but could not find why the params were not training even when looking under the lens.
But after more testings and so on, I see that the second example started to train correctly while the first on the file not, and it was that with all the fresh runs, so I thought it was a problem with the learner, but could not find a “real problem” so I returned back to the optimizer and all, but this time I have a new “tool” I learned, counting the trainable parameters so, the trainable parameters their gradients are updated when you call backward, so I started to count there and for the first example they where always zero while for the second run since start, they have a number. So the next task was to find why on the first the trainable parameters are always zero and the second not.

But I still didn’t get why one model was training and the other one not!

I passed _BaseOptimizer, Optimizer, Learner and others and still could not find the problem, so I decided to compare models and found the problem! I updated the example I found in pytorch forums https://discuss.pytorch.org/t/two-models-with-same-weights-different-results/8918/7 the original one at first run did break because it compared tensors not on same device, so it threw error, I modified it so that it prints them nicely instead of being caught in that error.

def compare_models(model_1, model_2):
    models_differ = 0
    for key_item_1, key_item_2 in zip(model_1.state_dict().items(), model_2.state_dict().items()):
        if key_item_1[1].device == key_item_2[1].device  and torch.equal(key_item_1[1], key_item_2[1]):
            pass
        else:
            models_differ += 1
            if (key_item_1[0] == key_item_2[0]):
                _device = f'device {key_item_1[1].device}, {key_item_2[1].device}' if key_item_1[1].device != key_item_2[1].device else ''
                print(f'Mismatch {_device} found at', key_item_1[0])
            else:
                raise Exception
    if models_differ == 0:
        print('Models match perfectly! :)')

And that was the solution to the problem, I focused on seeing why the models parameters were on different devices. At the end I have something like (remember I don’t need to patch the optimizer because I have all installed locally).

def create_opt(self):
    print('trainable count before', len(self.all_params(with_grad=True)))
    self.opt = self.opt_func(self.splitter(self.model), lr=self.lr)
    print('trainable count after', len(self.all_params(with_grad=True)))
    if not self.wd_bn_bias:
        for p in self._bn_bias_state(True ): p['do_wd'] = False
    if self.train_bn:
        for p in self._bn_bias_state(False): p['force_train'] = True

and

        print('trainable count before backward', len(self.all_params(with_grad=True)))
        self('before_backward')
        print('trainable count before backward', len(self.all_params(with_grad=True)))
        self._backward()
        self('after_backward')

So at the end I see that even that the model is moved later to the device, the first time when splitter=trainable_params in self.opt = self.opt_func(self.splitter(self.model), lr=self.lr) inside create_opt it is not there, so the parameters where stuck on CPU while the the data and the model is later moved to the XLA device.

This does not affects GPUs, but thinking about it could mean also something about the pickable behaviour of xla tensors, specially the optimizer, but that is an history for another time, right now, we have again a simple lib that works for Single-device TPUs where you need to modify zero code from fastai.

Conclusion

So the model need to be in TPU/XLA device before their parameters are taked by splitter on Optimizer initialization, I guess we assumed some things in between then and now. At the end it was not exactly an error but was. But sure it was difficult to track, now knowing what it is is solved and we can continue forward.

I hope to add in the next release a show_lowering_ops (or similar) to print the counters if you have hit some of those and it is easy to print in a model that runs with this activated. The MNIST demo should be working again don’t forget to peek at fastai_xla_extensions.

EXTRA NOTE: But there was error because XLA model on CPU not trained when updating backward from data operations on XLA device? well, now I think that XLA worked on TPU with model and data copied to TPU on first time but somehow our model got stuck on CPU so not trained, it became another model separate from the execution happening on TPU (I can think of pickable things, but that is unknown at the moment)